iT邦幫忙

2021 iThome 鐵人賽

DAY 27
0

What is Colly?

Colly是一種Golang的網路爬蟲工具,而網路爬蟲Web Crawler簡而言之就是在網路上能夠自動的進行資料搜集與解析的工具。

因此這章節我們將會介紹如何使用Colly來進行特定網域與網站的資料搜集!

Installation

go get -u github.com/gocolly/colly

How to Use Colly?

app/crawler/collier.go

package crawler

import (
	"github.com/gocolly/colly"
	"github.com/sirupsen/logrus"
	"ironman-2021/app/middleware"
)

func Collier(url string) {
	var body string
	c := colly.NewCollector(
		colly.UserAgent("Mozilla/5.0 (compatible; Googlebot/2.1; +https://www.google.com/bot.html)"),
		)
	c.OnRequest(func(r *colly.Request) {
		middleware.Logger().WithFields(logrus.Fields{
			"name": "Collier",
		}).Info("Visiting", r.URL)
	})
	c.OnError(func(_ *colly.Response, err error) {
		middleware.Logger().WithFields(logrus.Fields{
			"name": "Collier",
		}).Info("Visiting Failed, err: ", err)
	})
	c.OnResponse(func(r *colly.Response) {
		body = string(r.Body)
		middleware.Logger().WithFields(logrus.Fields{
			"name": "Collier",
		}).Info("Visited, body: ", body)
	})
	c.OnScraped(func(r *colly.Response) {
		middleware.Logger().WithFields(logrus.Fields{
			"name": "Collier",
		}).Info("Finished", r.Request.URL)
	})
	err := c.Visit(url)
	if err != nil {
		return
	}
}
  • 首先我們創造出一個Collector的實例叫c
  • 接著制定好在爬蟲各個步驟時所要執行的動作,基本上除了onResponse我們是將爬蟲的結果寫入Log外,其餘步驟都是將執行步驟寫入Log之中。

main.go

server.GET("/crawler", func(c *gin.Context) {
		crawler.Collier("https://ithelp.ithome.com.tw/users/20129737/ironman/4014")
		c.String(http.StatusOK, fmt.Sprintf("Finished Coller"))
	})

最後我們則是在主程式中加一隻簡單的GET API來觸發執行。

logs/2021-10-10.log

time="101010-10-10 1010:1010:1010" level=info msg="Health CheckInfo" name="Flynn Sun"
time="101010-10-10 1010:1010:1010" level=info msg="| 200 |      5.3626ms |      172.19.0.1 | GET | /hc |"
time="101010-10-10 1010:1010:1010" level=info msg="Visitinghttps://ithelp.ithome.com.tw/articles/10279931" name=Collier
time="101010-10-10 1010:1010:1010" level=info msg="Visited, body: <!DOCTYPE html>\n<html lang=\"zh-TW\">\n\n<head>\n    <meta charset=\"utf-8\">\n<meta http-equiv=\"X-UA-Compatible\" content=\"IE=edge\">\n<meta name=\"viewport\" content=\"width=device-width, initial-scale=1\">\n\n\n<title>Day25 Gin with API Test - iT 邦幫忙::一起幫忙解決難題,拯救 IT 人的一天</title>\n\n<meta name=\"description\" content=\"What is API Test? 我們可以把它想成Unit Test單元測試的一種,不過它所涵蓋的最好集合不像以往的UnitTest可能以Function為主,而是Endpoint。 透過API T...\"/>\n<meta name=\"keywords\" content=\"iT邦幫忙,iThome\">\n<meta name=\"author\" content=\"iThome\">\n<meta property=\"og:site_name\" content=\"iT 邦幫忙::一起幫忙解決難題,拯救 IT 人的一天\"/>\n<meta property=\"og:url\" content=\"https://ithelp.ithome.com.tw/articles/10279931\"/>\n<meta property=\"og:type\" content=\"website\"/>\n<meta property=\"og:title\" content=\"Day25 Gin with API Test - iT 邦幫忙::一起幫忙解決難題,拯救 IT 人的一天\"/>\n<meta property=\"og:image\" content=\"https://ithelp.ithome.com.tw/upload/images/20211010/20129737oKVtf3CBHN.png\"/>\n<meta property=\"og:description\" content=\"What is API Test? 我們可以把它想成Unit Test單元測試的一種,不過它所涵蓋的最好集合不像以往的UnitTest可能以Function為主,而是Endpoint。 透過API T...\"/>\n<meta property=\"fb:app_id\" content=\"137875859607921\" />\n\n<link rel=\"apple-touch-icon\" sizes=\"57x57\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-57x57.png\">\n<link rel=\"apple-touch-icon\" sizes=\"60x60\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-60x60.png\">\n<link rel=\"apple-touch-icon\" sizes=\"72x72\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-72x72.png\">\n<link rel=\"apple-touch-icon\" sizes=\"76x76\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-76x76.png\">\n<link rel=\"apple-touch-icon\" sizes=\"114x114\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-114x114.png\">\n<link rel=\"apple-touch-icon\" sizes=\"120x120\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-120x120.png\">\n<link rel=\"apple-touch-icon\" sizes=\"144x144\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-144x144.png\">\n<link rel=\"apple-touch-icon\" sizes=\"152x152\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-152x152.png\">\n<link rel=\"apple-touch-icon\" sizes=\"180x180\" href=\"https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-180x180.png\">\n<link rel=\"icon\" type=\"image/png\" href=\"https://ithelp.ithome.com.tw/storage/favicons/favicon-32x32.png\" sizes=\"32x32\">\n<link rel=\"icon\" type=\"image/png\"
...
href=\"https://ithelp.ithome.com.tw/storage/favicons/android-chrome-192x192.png\" sizes=\"192x192\">\n<link rel=\"icon\" type=\"image/png\" v>\n                                <div><a href=\"#\" class=\"invitation-list__account\">{{ result.account }}</a>\n                                </div>\n                            </div>\n                        </li>\n                    </ul>\n                </div>\n                <div class=\"modal-footer\">\n                    <a type=\"button\" class=\"btn btn-main\" data-dismiss=\"modal\">關閉</a>\n                </div>\n            </div>\n        </div>\n    </div>\n    </body>\n\n</html>" name=Collier
time="101010-10-10 1010:1010:1010" level=info msg="Finishedhttps://ithelp.ithome.com.tw/articles/10279931" name=Collier
time="101010-10-10 1010:1010:1010" level=info msg="| 200 |    623.3349ms |      172.19.0.1 | GET | /crawler |"

我們最後可以在log當中發現我們爬蟲的紀錄!

Dig deeper

那接下來則是示範難度更高的爬蟲!

首先來看一下上面我們爬取的頁面結構

(https://ithelp.ithome.com.tw/users/20129737/ironman/4014)

<!DOCTYPE html>
<html lang="zh-TW">

<head>
    <meta charset="utf-8">
<meta http-equiv="X-UA-Compatible" content="IE=edge">
<meta name="viewport" content="width=device-width, initial-scale=1">

<title>fmt.Println(&quot;從零開始的Golang生活&quot;) :: 2021 iThome 鐵人賽</title>

<meta name="description" content="講述一位Python Developer如何從零開始學習Go,並透過該角度進行解析。"/>
<meta name="keywords" content="iT邦幫忙,iThome">
<meta name="author" content="iThome">
<meta property="og:site_name" content="iT 邦幫忙::一起幫忙解決難題,拯救 IT 人的一天"/>
<meta property="og:url" content="https://ithelp.ithome.com.tw/users/20129737/ironman/4014"/>
<meta property="og:type" content="website"/>
<meta property="og:title" content="fmt.Println(&quot;從零開始的Golang生活&quot;) :: 2021 iThome 鐵人賽"/>
<meta property="og:image" content="https://ithelp.ithome.com.tw/images/ironman/13th/fb.jpg"/>
<meta property="og:description" content="講述一位Python Developer如何從零開始學習Go,並透過該角度進行解析。"/>
<meta property="fb:app_id" content="137875859607921" />

<link rel="apple-touch-icon" sizes="57x57" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-57x57.png">
<link rel="apple-touch-icon" sizes="60x60" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-60x60.png">
<link rel="apple-touch-icon" sizes="72x72" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-72x72.png">
<link rel="apple-touch-icon" sizes="76x76" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-76x76.png">
<link rel="apple-touch-icon" sizes="114x114" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-114x114.png">
<link rel="apple-touch-icon" sizes="120x120" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-120x120.png">
<link rel="apple-touch-icon" sizes="144x144" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-144x144.png">
<link rel="apple-touch-icon" sizes="152x152" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-152x152.png">
<link rel="apple-touch-icon" sizes="180x180" href="https://ithelp.ithome.com.tw/storage/favicons/apple-touch-icon-180x180.png">
<link rel="icon" type="image/png" href="https://ithelp.ithome.com.tw/storage/favicons/favicon-32x32.png" sizes="32x32">
<link rel="icon" type="image/png" href="https://ithelp.ithome.com.tw/storage/favicons/android-chrome-192x192.png" sizes="192x192">
<link rel="icon" type="image/png" href="https://ithelp.ithome.com.tw/storage/favicons/favicon-96x96.png" sizes="96x96">
<link rel="icon" type="image/png" href="https://ithelp.ithome.com.tw/storage/favicons/favicon-16x16.png" sizes="16x16">
<link rel="manifest" href="https://ithelp.ithome.com.tw/storage/favicons/manifest.json">
<link rel="mask-icon" href="https://ithelp.ithome.com.tw/storage/favicons/safari-pinned-tab.svg" color="#5bbad5">
<meta name="msapplication-TileColor" content="#da532c">
<meta name="msapplication-TileImage" content="https://ithelp.ithome.com.tw/storage/favicons/mstile-144x144.png">
<meta name="theme-color" content="#ffffff">

<link rel="stylesheet" href="https://ithelp.ithome.com.tw/css/bootstrap.min.css">
<link rel="stylesheet" href="//ajax.googleapis.com/ajax/libs/jqueryui/1.11.3/themes/smoothness/jquery-ui.css"/>
<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/font-awesome/4.5.0/css/font-awesome.min.css">
<link rel="stylesheet" href="https://fonts.googleapis.com/css?family=Lato:400,700">
<link rel="stylesheet" href="//cdn.jsdelivr.net/simplemde/latest/simplemde.min.css">
<link rel="stylesheet" href="https://ithelp.ithome.com.tw/css/sweetalert.css">
<link rel="stylesheet" href="https://ithelp.ithome.com.tw/lib/select2/css/select2.min.css">
<link rel="stylesheet" href="https://ithelp.ithome.com.tw/css/google.css">
<link rel="stylesheet" href="https://ithelp.ithome.com.tw/css/style.css?202008271142">
<!-- highlight -->
<link rel="stylesheet" href="https://ithelp.ithome.com.tw/css/railscasts.css">
<!-- end -->
<!-- HTML5 shim and Respond.js for IE8 support of HTML5 elements and media queries -->
<!-- WARNING: Respond.js doesn't work if you view the page via file:// -->
<!--[if lt IE 9]>
<script src="https://oss.maxcdn.com/html5shiv/3.7.2/html5shiv.min.js"></script>
<script src="https://oss.maxcdn.com/respond/1.4.2/respond.min.js"></script>
<![endif]-->
<!--messenger css-->
    </head>

<body>
    <div class="header">
    <div class="header__inner clearfix">
        <h1 class="header__logo pull-left"><a href="/"><img src="https://ithelp.ithome.com.tw/storage/image/logo.svg" alt="iT邦幫忙" class="img-responsive"></a></h1>
        <div class="header__promote">
            <div class="a12word pull-right">
                <div class="a12word__box">
                    <script type="text/javascript" src="https://itadapi.ithome.com.tw/media/serve?type=T2&channel=ithome_forum&encoding=Utf8"> </script>
                </div>
                <div class="a12word__box">
                    <script type="text/javascript" src="https://itadapi.ithome.com.tw/media/serve?type=T3&channel=ithome_forum&encoding=Utf8"> </script>
                </div>
                <div class="a12word__box">
                    <script type="text/javascript" src="https://itadapi.ithome.com.tw/media/serve?type=T4&channel=ithome_forum&encoding=Utf8"> </script>
                </div>
            </div>

            <div class="a970 pull-right">
                <script src="https://itadapi.ithome.com.tw/media/serve?type=B1&channel=ithome_forum&encoding=Utf8"></script>
            </div>
        </div>
    </div>
.............
    </body>

</html>

如果我們只想要過濾並爬取每篇鐵人賽文章的標題而已,那我們可以發現文章標題的都會固定在

<body><div class="board leftside profile-main"><div class="ir-profile-content"><div class="profile-list__content">

然後每個<div class="profile-list__content">

內部都能找到<h3 class="qa-list__title"><a class="qa-list__title-link"> title </a>

...
<body>
	...
	<div class="board leftside profile-main">
		<div class="ir-profile-content">
			...
			<div class="profile-list__content">
				...
				<h3 class="qa-list__title">
					<a href="https://ithelp.ithome.com.tw/articles/10267570" class="qa-list__title-link">
		        Day4 Variable
	        </a>
	      </h3>
			...
		<div>
		...
...

因此我們透過XPATH的方式來解析並取得我們想要的title

app/crawler/collier.go


c.OnResponse(func(r *colly.Response) {
		doc, err := htmlquery.Parse(strings.NewReader(string(r.Body)))
		if err != nil {
			middleware.Logger().WithFields(logrus.Fields{
				"name": "Collier",
			}).Fatal("Visited fatal, error: ", err)
		}
		titles := htmlquery.Find(doc, `//div[@class="board leftside profile-main"]//div[@class="ir-profile-content"]//div[@class="profile-list__content"]`)
		for _, node := range titles {
			title := htmlquery.FindOne(node, `//h3[@class="qa-list__title"]//a[@class="qa-list__title-link"]/text()`)
			middleware.Logger().WithFields(logrus.Fields{
				"name": "Collier",
			}).Info("Title: ", htmlquery.InnerText(title))
		}
	})
  • 拿到Response時,首先我們先依序找到div[@class="profile-list__content"]
  • 接著用個for loop來找出每個scope當中的title info
  • 最後則是用htmlquery.InnerText()將它轉成string並寫入log當中

那寫入log的資料會如下

time="111110-10-10 1010:1010:1010" level=info msg="Visiting: https://ithelp.ithome.com.tw/users/20129737/ironman/4014" name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n                                                            Day1 Why Go?\n                                                    " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n                                                            Day2 Develop Environment For Go\n                                                    " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n                                                            Day3 First Go application\n                                                    " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n                                                            Day4 Variable\n                                                    " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n                                                            Day5 Type\n                                                    " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n                                                            Day6 Array and Slice\n                                                    " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n                                                            Day7 Map and Struct\n                                                    " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n                                                            Day8 Function and Interface\n                                                    " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n                                                            Day9 Goroutine\n                                                    " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Title: \n                                                            Day10 Sync.WaitGroup & Sync.Map\n                                                    " name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="Finished: https://ithelp.ithome.com.tw/users/20129737/ironman/4014" name=Collier
time="111110-10-10 1010:1010:1010" level=info msg="| 200 |    692.5716ms |    192.168.16.1 | GET | /crawler |"

Summary

這章節我們實作如何用Colly爬取鐵人賽的頁面,並打印出所有的標題,以後我們要爬取特定網域或資料時,也不用只局限於使用Python,Go也會是個好選擇!

這次的程式碼我也會放在下方連結提供參考

https://github.com/Neskem/Ironman-2021/tree/Day-27


上一篇
Day26 Gin with Logger
下一篇
Day28 Gin with SMTP Server
系列文
fmt.Println("從零開始的Golang生活")30
圖片
  直播研討會
圖片
{{ item.channelVendor }} {{ item.webinarstarted }} |
{{ formatDate(item.duration) }}
直播中

尚未有邦友留言

立即登入留言